Abstract:With the emergence of wireless applications in three-dimensional environments, such as the low-altitude airspace and 3D heterogeneous networks, radio map estimation is increasingly required to characterize signal propagation across both horizontal and vertical dimensions. However, extending radio map estimation from 2D to 3D remains challenging due to increased spatial sparsity and limited supervision across continuous altitudes. In this paper, we propose \textbf{\textit{RadioFormer3D}}, a specialized model for volumetric spectrum reconstruction under weak supervision. Building on the dual-stream, multi-granularity fusion architecture of \textit{RadioFormer}, \textit{RadioFormer3D} introduces a Fourier-based sampling encoder and a volumetric decoder to efficiently process sparse measurements in 3D space. To alleviate the lack of vertical supervision, we propose the \textbf{\textit{Joint Spectrum Integrity Loss}}, which integrates volume-level pseudo-label supervision, map-level geometry-aware radio rendering, and pixel-level localized constraints within a unified optimization scheme. This design enables the model to capture complex vertical structural relationships more effectively under sparse supervision. Extensive experiments across several radio map datasets show that \textit{RadioFormer3D} achieves superior overall performance compared to representative existing methods. In particular, it demonstrates improved reconstruction quality at unlabeled altitudes while maintaining a favorable trade-off between accuracy and inference efficiency, positioning it as a highly promising solution for future 3D environment-aware wireless networks.
Abstract:Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.
Abstract:Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of token-aligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g. on Qwen3-Omni, $\mathrm{ASR}_{l}$ remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.
Abstract:Entity alignment (EA) aims to identify entities across different knowledge graphs (KGs) that refer to the same real-world object and plays a critical role in knowledge fusion and integration. Traditional EA methods mainly rely on knowledge representation learning, but their performance is often limited under noisy or sparsely supervised scenarios. Recently, large language models (LLMs) have been introduced to EA and achieved notable improvements by leveraging rich semantic knowledge. However, existing LLM-based EA approaches typically treat LLMs as black-box decision makers, resulting in limited interpretability, and the direct use of large-scale triples substantially increases inference cost. To address these challenges, we propose \textbf{EA-Agent}, a reasoning-driven agent for EA. EA-Agent formulates EA as a structured reasoning process with multi-step planning and execution, enabling interpretable alignment decisions. Within this process, it introduces attribute and relation triple selectors to filter redundant triples before feeding them into the LLM, effectively addressing efficiency challenges. Experimental results on three benchmark datasets demonstrate that EA-Agent consistently outperforms existing EA methods and achieves state-of-the-art performance. The source code is available at https://github.com/YXNan0110/EA-Agent.
Abstract:Multi-Agent Pathfinding (MAPF) plays a critical role in various domains. Traditional MAPF methods typically assume unit edge costs and single-timestep actions, which limit their applicability to real-world scenarios. MAPFR extends MAPF to handle non-unit costs with real-valued edge costs and continuous-time actions, but its geometric collision model leads to an unbounded state space that compromises solver efficiency. In this paper, we propose MAPFZ, a novel MAPF variant on graphs with non-unit integer costs that preserves a finite state space while offering improved realism over classical MAPF. To solve MAPFZ efficiently, we develop CBS-NIC, an enhanced Conflict-Based Search framework incorporating time-interval-based conflict detection and an improved Safe Interval Path Planning (SIPP) algorithm. Additionally, we propose Bayesian Optimization for Graph Design (BOGD), a discretization method for non-unit edge costs that balances efficiency and accuracy with a sub-linear regret bound. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in runtime and success rate across diverse benchmark scenarios.
Abstract:Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.
Abstract:Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at https://anonymous.4open.science/r/SurgFed-070E/.
Abstract:Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.
Abstract:Tool learning is increasingly important for large language models (LLMs) to effectively coordinate and utilize a diverse set of tools in order to solve complex real-world tasks. By selecting and integrating appropriate tools, LLMs extend their capabilities beyond pure language understanding to perform specialized functions. However, existing methods for tool selection often focus on limited tool sets and struggle to generalize to novel tools encountered in practical deployments. To address these challenges, we introduce a comprehensive dataset spanning 7 domains, containing 155 tools and 9,377 question-answer pairs, which simulates realistic integration scenarios. Additionally, we propose MetaToolAgent (MTA), a meta-learning approach designed to improve cross-tool generalization. Experimental results show that MTA significantly outperforms baseline methods on unseen tools, demonstrating its promise for building flexible and scalable systems that require dynamic tool coordination.